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Abstract 

Statistical methods for automatically identifying de- 
pendent word pairs (i.e. dependent bigrams) in a cor- 
pus of natural language text have traditionally been 
performed using asymptotic tests of significance. This 
paper suggests that Fisher's exact test is a more ap- 
propriate test due to the skewed and sparse data sam- 
ples typical of this problem. Both theoretical and 
experimental comparisons between Fisher's exact test 
and a variety of asymptotic tests (the t-test, Pearson's 
chi-square test, and Likelihood-ratio chi-square test) 
are presented. These comparisons show that Fisher's 
exact test is more reliable in identifying dependent 
word pairs. The usefulness of Fisher's exact test ex- 
tends to other problems in statistical natural language 
processing as skewed and sparse data appears to be the 
rule in natural language. The experiment presented in 
this paper was performed using PROC FREQ of the 
SAS System. 

Introduction 

Due to advances in computing power and the increas- 
ing availability of large amounts of on-line text the 
empirical study of human language has become an in- 
creasingly active area of research in both academic and 
commercial environments. 

Statistical natural language processing (NLP) re- 
lies upon studying large bodies of text called corpora. 
These methods are useful because the unaided human 
mind simply cannot notice all important linguistic fea- 
tures, let alone rank them in order of importance, when 
dealing with large amounts of text. 

The difficulty in studying human language based 
upon samples from text is that despite having billions 
of words on-line it is still difficult to collect large sam- 
ples of certain events as many linguistic events very 
rarely occur. Many features of human language ad- 
here to Zipf's Law (Zipf 1935). Informally stated, this 
law says that most events occur rarely and that a few 
very common events occur most of the time. 

Statistical NLP must inevitably deal with a large 
number of rare events. Typical NLP data violates the 
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large sample assumptions implicit in traditional good- 
ness of fit tests such as Pearson's X'^ , the Likelihood 
ratio G'^ and the t-test. When this occurs the results 
obtained from these tests may be in error. An alterna- 
tive to these statistics is Fisher's exact test, which as- 
signs significance by exhaustively computing all prob- 
abilities for a contingency table with fixed marginal 
totals. 

This paper presents an experiment that compares 
the effectiveness of X^, G^, the t-test and Fisher's ex- 
act test in identifying dependent bigrams. When these 
results are different it is shown why Fisher's exact test 
gives the most reliable significance value. All of these 
tests can be conveniently performed using the SAS Sys- 
tem (SAS Institute 1990). 

Lexical Relationships 

A common problem in NLP is the identification of 
strongly associated word pairs. A bigram is any two 
consecutive words that occur together in a text. The 
frequency with which a bigram occurs throughout a 
text says something about the relationship between the 
words that make up the bigram. A dependent bigram 
is one where the two words are related in some way 
other than what would be expected purely by chance. 
Intuitively appealing examples of dependent bigrams 
include major league, southern baptist and fine 
wine. 

The challenge in identifying dependent bigrams is 
that most bigrams are relatively rare regardless of the 
size of the text. This follows from the the distribu- 
tional tendencies of individual words and bigrams as 
described in Zipf's Law (Zipf 1935). Zipf found that if 
the frequencies of the words in a large text are ordered 
from most to least frequent, (/i, /2, . . . , fm), these fre- 
quencies roughly obey: fi cx j. The implications of 
Zipf's law are two-sided for statistical NLP. The good 
news is that a significant proportion of a corpus is made 
up of the most frequent words; these occur frequently 
enough to collect reliable statistics on them. The bad 
news is that there will always be a large number of 
words that occur just a few times. 

As an example, in a 133,000 word subset of the 
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Figure 1: Distribution of single words 



Figure 2: Distribution of bigrams 



ACL/DCI Wall Street Journal corpus (Marcus et al. 
1993) there are 14,319 distinct words and 73,779 dis- 
tinct bigrams. Of the distinct words, 48 percent of 
them occur only once and 80 percent of them occur 
five times or less. Of the distinct bigrams, 81 percent 
occur once and 97 percent of them occur five times or 
less. This data is represented graphically in Figures 
1 and 2. As a result of these distributional tenden- 
cies, data samples characterizing specific bigrams are 
terribly skewed. This kind of data violates the large 
sample assumptions regarding the distributional char- 
acteristics of a data sample that are made by asymp- 
totic significance tests. 

Representation of the Data 

To represent the data in terms of a statistical model, 
the features of each object are mapped to random vari- 
ables. The relevant features of a bigram are the two 
words that form the bigram. 

If each bigram in the data sample is characterized 
by two features represented by the binary variables X 
and Y, then each bigram will have one of four possible 
classifications corresponding to the possible combina- 
tions of these variable values. In this case, the data is 
said to be cross-classified with respect to the variables 
X and Y. The frequency of occurrence of these classi- 
fications can be shown in a square table having 2 rows 
and 2 columns. The frequency counts of each of the 4 
possible data classifications in Figure 3 are denoted by 
nil, "12, "21, and n-z'z- 

The joint frequency distribution of X and Y is de- 
scribed by the counts {riij} for the data sample repre- 
sented in the contingency table. The marginal distri- 
butions of X and Y are the row and column (equation 



1) totals obtained by summing the joint frequencies. 
The row variable is denoted rii^ and the column vari- 
able . The subscript + indicates the index over 
which summing has occurred. 



n+i=^nij (1) 
i=i i=i 
More generally, if there are I possible values for the 
first variable and J possible values for the second vari- 
able, then the frequency of each classification can be 
recorded in a rectangular table having I rows and J 
columns. Each cell of this table represents one of the 
I* J possible combinations of the variable values. Such 
a table is called an 7 x J contingency table. 
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Figure 3: Contingency Table 

As shown in Figure 3, in order to study the associ- 
ation (i.e., degree of dependence) between the words 
oil and industry, the variable X is used to denote 
the presence or absence of oil in the first position of 
each bigram, and Y is used to denote the presence or 
absence of industry in the second position. 

Significance Testing 

In both exact and asymptotic significance testing, a 
probabilistic model is used to describe the distribu- 
tion of the population from which the data sample 



was drawn. The acceptability of a potential popula- 
tion model is postulated as a null hypothesis and that 
hypothesis is tested by evaluating the fit of the model 
to the data sample. The fit is considered acceptable if 
the model differs from the data sample by an amount 
consistent with sampling variation, that is, if the value 
of the metric measuring the fit of the model is statis- 
tically significant. 

The steps involved in performing a significance test 
are listed below and discussed in the subsections that 
follow. Both exact and asymptotic significance tests 
follow steps 1 and 2. 

1. Select an appropriate sampling plan, 

2. hypothesize a population model, 

3. select a summary statistic to use in testing the fit of 
the hypothesized model to the sampled data, and 

4. assess the statistical significance of the model: de- 
termine the probability that the data came from a 
population described by the model. 

Steps 3 and 4 are more commonly associated with 
an asymptotic significance test. An exact test does not 
use a goodness of fit statistic but the notion of assessing 
significance still remains. The differences between the 
two approaches will be discussed in more detail shortly. 

Sampling Plan 

In order for a significance test to yield valid results the 
data must be collected from the population via a ran- 
dom sampling plan. The sampling plan assures that 
each object in the data sample is selected via indepen- 
dent and identical trials. The sampling plan together 
with the population characteristics can be used to de- 
fine the likelihood of selecting any particular sample. 

In the experiment for this paper a multinomial sam- 
pling plan was used. In multinomial sampling the over- 
all sample size is determined in advance and each 
object is randomly selected from the population to be 
studied. Given this, the probability of observing a par- 
ticular frequency distribution {riij} in a randomly se- 
lected data sample is shown in equation (2), where the 
Pij 's are the population characteristics specifying the 
probability of classification («, i). 
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The data in Figure 3 was sampled using a multi- 
nomial sampling plan. This data is used to test the 
bigram oil industry for association. When this data 
was sampled, the only value that was fixed prior to the 
beginning of the experiment was the total sample size, 
which was equal to 1,382,828. 

Hypothesizing a Model 

The population model used to study association be- 
tween two words, where the two words are represented 



by the binary variables X and Y, is the model for in- 
dependence between X and Y: 



P{x,y) = P{x)P{y) 



(3) 



If the model for independence fits the data well as 
measured by its statistical significance, then one can 
infer from this data sample that these two words are 
independent in the larger population. The worse the 
fit, the more dependent the words are judged to be. 

Using the notation introduced previously, the pa- 
rameters of the model for independence between two 
words (i.e., the words oil and industry in Figure 3) 
are estimated as follows: 
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In significance testing, the population model is the 
null hypothesis that is tested. This hypothesis can only 
be rejected or accepted, it can not be proven true or 
false with absolute certainty. The significance assigned 
to the hypothesis indicates how likely it is that the 
sample was drawn from a population specified by that 
model. 

Goodness of Fit Statistics A goodness of fit statis- 
tic is used to measure how closely the counts observed 
in a data sample correspond to those that would be ex- 
pected in a random sample drawn from a population 
where the null hypothesis is true. 

In this section, we discuss three metrics that have 
been used to measure the fit of the models for associ- 
ation: the likelihood ratio statistic G^, Pearson's X'^ 
statistic and the t-statistic. The distribution of each 
of these statistics can be approximated when the hy- 
pothesis is true and certain other conditions hold; they 
therefore can be used in asymptotic significance test- 
ing. In Fisher's exact test a goodness of fit statistic is 
not employed. 

G'^ and X'^ These statistics measure the divergence 
of observed (wij) and expected (rriij) sample counts, 
where the expectation is based on a hypothetical pop- 
ulation model. These statistics can be conveniently 
computed using PROG FREQ of the SAS System. 

The first step in calculating either or X^ is to cal- 
culate the expected counts given that the hypothetical 
population model is correct. In the model for indepen- 
dence, maximum likelihood estimates of the expected 
counts are formulated as in equation (5) where niij 
denotes the expected count in contingency table cell 
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Using this formulation, G and X are calculated as 
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(6) 

When the hypothetical population model is the true 
population model, the distribution of both and 
converges to as the sample size grows large (i.e., the 
distribution is an asymptotic approximation for the 
distributions of G^ and X^). More precisely, X^ and 
G^ are approximately x^ distributed when the follow- 
ing conditions regarding the random data sample hold 
(Read and Cressie 1988): 

1. the sample size is large, 

2. the number of cells in the contingency table repre- 
sentation of the data is fixed and small relative to 
the sample size, and 

3. the expected count (under the hypothetical popula- 
tion model) for each cell is large. 

(Dunning 1993) shows that G^ holds more closely to 
the x^ distribution than does X^ when dealing with 
bigram data. However, as pointed out in (Read and 
Cressie 1988), it is uncertain whether G^ holds to the 
X^ distribution when the minimum of the expected val- 
ues in a table is less than 1.0. Since low expected 
frequencies appear to be the rule in bigram data (e.g. 
column mil in Figure 8) we suggest that the reliability 
of the x^ approximation to G^ could be in question. 

the t-statistic The t-statistic (equation 7) measures 
the difference between the mean of a randomly drawn 
sample (x) and the hypothesized mean for the popu- 
lation from which that sample was drawn (jJ-o)- This 
difference is scaled by the variance of the population. 
When the variance of the population is unknown and 
the sample size is large, standard statistical techniques 
allow that the population variance can be estimated 
by the sample variance (s^) which is in turn scaled by 
the sample size (n). 
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(Church et al. 1991) show how the t-statistic can be 
used to identify dependent bigrams. The data sam- 
ple is produced through a series of Bernoulli trials 
that record the presence or absence of a single bigram. 
Given this sampling plan the sample mean is defined 
to be the relative frequency of the bigram (^^) and 
the sample variance is roughly approximated by that 
same relative frequency. The t-statistic can then be 
rewritten as in equation 8. 
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In the i-iesi, significance is assigned to the t-statistic 
using the t-distribution, which is equal to the stan- 
dard normal distribution in large sample experiments. 
This approach to assigning significance is based on the 
assumption that the sample means are normally dis- 
tributed. This assumption is shown to be inappropri- 
ate for bigram data in (Dunning 1993). 

The formulation of (Church ei al. 1991) is equivalent 
to a one-sample t-test. PROC TTEST of the SAS 
System computes a two sample t-test and was not used 
to compute the t-statistic values. Instead a separate 
data step was created to calculate the value in equation 
8 and significance was assigned to that value using the 
PROBT function. 

Assessing Statistical Significance 

If the test statistic used to evaluate a model has a 
known distribution when the model is correct, that dis- 
tribution can be used to assign statistical significance. 
For X^ and G^ the x^ distribution is used while the 
t-test uses the t-distribution. These serve as reliable 
approximations of distributions of the test statistics 
when certain assumptions hold. However, as has been 
pointed out, these assumptions are frequently violated 
in bigram data. 

An alternative to using a significance test based on 
an approximate distribution is to use an exact signifi- 
cance test. In particular, for bigram data Fisher's ex- 
act test is recommended. This test can be performed 
using PROC FREQ in the SAS System. 

Fisher's Exact Test 

Rather than using an asymptotic approximation of the 
significance of observing a particular table. Fisher's ex- 
act test (Fisher 1935) computes the significance of an 
observed table by exhaustively computing the proba- 
bility of every table that would lead to the marginal 
totals that were observed in the sampled table. 

The significance values obtained using Fisher's exact 
test are reliable regardless of the distributional charac- 
teristics of the data sample. However, when the num- 
ber of comparable data samples is large, the exhaustive 
enumeration performed in Fisher's exact test becomes 
infeasible. In (Pedersen et al. 1996) an alternative 
test, the exact conditional test, is discussed for tables 
where Fisher's exact test is not a practical option. 

When performing Fisher's exact test in a 2 x 2 con- 
tingency table the marginal totals ni_|_ and n_|_i and 
the sample size are fixed at their observed value. 
Given this, the value of rin determines the counts in 
ni2, n2i and n22- All of the possible 2x2 tables that 
adhere to the fixed marginal totals are generated and 
the probability of each table is computed using the hy- 
pergeometric distribution. 

Given that all the marginals and the sample size 
is fixed the hypergeometric probability of observing 
a particular frequency distribution {«ii, «i2, '^21, '^22} 
can be computed using equation 9. Hypergeometric 



probabilities can be computed with the SAS System 
using the PROBHYPR function. PROBHYPR was 
used to compute the individual table probabilities the 
author added to the PROC FREQ output in the ap- 
pendix and the data plotted in Figure 5. 

p ^ 1 ^ wi+!w2+!w+i!w+2! .g^ 

niilni'jln'jiln'j'jl «++! 

The original problem that Fisher used to present 
this test has gone down in statistical lore as the Tea 
Drinker's Problem. A woman claimed that by tasting a 
cup of tea with milk she could determine if the milk or 
tea had been added first. An experiment was designed 
where eight cups of tea were mixed, four cups with the 
milk added first and four with the tea added first. The 
eight cups of tea were presented to the woman in ran- 
dom order and she was asked to divide the 8 cups into 
two sets of 4, one set being those cups where milk was 
added first and the other those where tea was added 
first. 

Given that the lady knew that there were 8 cups of 
tea and that 4 of the cups had the tea added first and 
4 had the milk added first it is clear that all marginal 
totals should be fixed at 4 and the sample size fixed at 
8. Given this there are 5 possible outcomes to the ex- 
periment (rill = 0, 1, 2, 3, and 4). Figure 5 shows the 
distributions of the hypergeometric probabilities asso- 
ciated with those 5 possible tables. This problem can 
be represented using a 2 x 2 contingency table where 
variable X represents the actual order of mixing and 
Y represents the order determined by the tea drinker. 
The 5 possible contingency tables and associated test 
values as generated by PROC FREQ are shown in the 
appendix. 

Interpreting the Tea Drinker's Problem 

The probabilities that result from Fisher's exact test 
indicate how likely it is that the observed table was 
drawn from a population where the null hypothesis is 
true. In other words, the test indicates how likely it 
would be to randomly sample a table more supportive 
of the null hypothesis than the observed table. In the 
Tea Drinker's Problem the null hypothesis is that the 
tea drinker is guessing and does not really know if the 
milk or tea was added first. This is the hypothesis of 
independence and is the same hypothesis used in the 
test for association that identifies dependent bigrams. 

Fisher's exact test can be interpreted as a one sided 
or two sided test. PROC FREQ shows all of the pos- 
sible results: two-sided, right-sided and left-sided. 

A one sided test can be either right or left sided. 
A right sided exact test is computed by summing the 
hypergeometric probabilities of all the tables with fixed 
marginal totals and whose cell count in rin 
is greater than or equal to the observed table. As an 
example consider the Tea Drinker's Problem where rin 
= 3. This implies that the tea drinker found 3 of the 
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Figure 5: n++ = 8, ni+ = 4, = 4 



4 cups where milk had been added first. To compute 
the significance of the right sided test the probabilities 
of the tables where nu = 3 (.229) and nu = 4 (.014) 
are summed. In this interpretation it is determined 
how likely it is that the tea drinker could perform more 
accurately in the experiment if it was repeated. A right 
sided test shows how likely it would be to randomly 
sample a table where nu is greater than or equal to 
the observed value when sampling from a population 
where the null hypothesis is true. The probability of 
being more accurate (i.e. the right sided probability) 
is .243 which leads to the conclusion that she is not 
guessing and has some idea of whether the milk or tea 
was added first. 

The left sided test is computed in the same fashion, 
except that it sums the probabilities of the tables where 
the count in nu is less than or equal to the observed 
value. Using the same example where nu =3 then 
the probabilities of the tables where nu = 3 (.229), 
nil = 2 (.514), nil = 1 (-229), and nu = (.014) are 
summed resulting in a left sided value of .986. The left 
sided test tells how likely it would be for the tea drinker 
to perform less accurately in the same experiment if it 
was repeated. Again, this is a fairly strong indication 
that the tea drinker is not simply guessing. 

The two sided exact test is computed by summing 
the hypergeometric probabilities of all tables with the 
same fixed marginals but whose probabilities are less 
than or equal to the probability of the observed ta- 
ble. Consider yet again the case where nu = 3. The 
probability of this table is .229. The tables that have 
a probability less than this are those where nu = 1 
(.229), nil = (.014) and nu = 4 (.014). The sum of 
these probabilities is .486 which is the result of the two 
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Figure 7: n++ = 10000, ni+ = 5000, n+i = 10 



sided exact test. The question answered here is how 
probable would it be for the tea drinker to guess less 
accurately than was observed. In more general terms 
this is asking the question how likely would it be to 
randomly sample a table where the probability of ob- 
serving a table where rin was equal to or less than 
the observed value when sampling from a population 
where the hypothesis of independence is true. On the 
surface this is a less convincing demonstration of the 
tea drinker's skill, however, it still provides reasonable 
evidence to support the conclusion that the tea drinker 
is not guessing. 

Interpreting the Bigram Experiment 

The sample sizes in the bigram data are quite a bit 
larger than those in the Tea Drinker's Problem. In 
addition, the bigram data is much more skewed. How- 
ever, Fisher's exact test remains a practical option 
since the number of possible tables is bound by the 
smallest marginal total (i.e, the smallest row or col- 
umn total) which for bigram data is associated with 
the overall count of the first or second word in the bi- 
gram. 

In the bigram data it would be more typical to find 
a sample size of 10,000 where = 20 and = 10. 
This implies that the row totals and n_|_2 are 10 
and 9,990 respectively. In this case there are 11 pos- 
sible tables where rin would range from to 10. The 
distribution of hypergeometric probabilities for these 
11 possible tables is shown in Figure 6. 

The practical effect of the skewed distribution shown 
in Figure 6 is that the right sided and two sided ex- 
act test for association are equivalent. The right-sided 
exact test value is the probability of observing a table 



with rill greater than or equal to the observed. The 
two-sided exact test value is the probability of observ- 
ing a table with a probability value less than or equal 
to the observed. These are identical since the value of 
P(nii) decreases as rin increases. 

For example, if nn > 2 the probability of observing 
a table with nn > 2 equals .000. Thus, if rin is greater 
than 2 then there is no probability that the two words 
in the bigram are independent. They must be related. 
Notice however that if the observed count is anything 
but a very small probability of independence is ob- 
served. Such skewed probabilities are observed for ta- 
bles where ^ and ^ These sorts 
of tables are what was observed with the bigram data. 

Notice that this was not the case in the Tea Drinker's 
Problem and would not be the case when the row to- 
tals are closer or equal in value. Consider the exam- 
ple shown in Figure 7 where = 10000, ni^ = 
5000 and = 10. It is easy to see that in this case 
= 712+ • In this case the distribution of the hyper- 
geometric probabilities is symmetric and the right and 
two-sided tests are different. 

In the test for association the marginal row totals 
and n2+ are never very close in value. counts 
how many times the first word in the bigram occurs 
in the text overall while n2+ is the count of all the 
other potential first words in the bigram. Since 
will always be much less than n2+ the distribution of 
the hypergeometric probabilities will always be very 
skewed. 

In the test for association to determine bigram de- 
pendence Fisher's exact test is interpreted as a left- 
sided test. This shows how probable would it be to 
see the observed bigram a fewer number of times in 



another random sample from a population where the 
hypothesis of independence is true. If this probabil- 
ity is high then the words that form the bigram are 
dependent. 

Experiment: Test for Association^ 

There are two fundamental assumptions that underly 
asymptotic significance testing: (1) the data must be 
collected via a random sampling of the population un- 
der study, and (2) the sample must exhibit certain dis- 
tributional characteristics. If either of these assump- 
tions is not met then the inference procedure may not 
provide reliable results. 

In this experiment, we compare the significance val- 
ues computed using the t-test, the approximation to 
the distribution of both G'^ and X'^ , and Fisher's exact 
test (left sided). Our data sample is a 1,382,828 word 
subset of the ACL/DCI Wall Street Journal corpus. 
We chose to characterize the associations established 
by the word industry as shown in bigrams of the form 
<word> industry. In Figure 8, we display a subset 
of 24 bigrams and their associated test results. 

As can be seen in Figure 8, there are differences in 
the significance values assigned by the various tests. 
This indicates that the assumptions required by certain 
of these tests are being violated. When this occurs, the 
significance values assigned using Fisher's exact test 
should be regarded as the most reliable since there are 
no restrictions on the nature of the data required by 
this test. 

Figure 8 displays the significance value assigned to 
the test for association between the word shown in col- 
umn one and industry. A significance value of .0000 
implies that this data shows no evidence of indepen- 
dence. The likelihood of having randomly selected this 
data sample from a population where these words were 
independent is zero. This is an indication of a depen- 
dent bigram. A significance value of 1.00 would in- 
dicate that there is an exact fit between the sampled 
data and the model for independence — there is no 
reason to doubt that this sample was drawn from a 
population in which these two words are independent. 
In this case the bigram is considered independent. 

In this figure, we show the relative rankings of the bi- 
grams according to their significance values. The most 
independent bigram is rank 1 and the most dependent 
bigram is rank 24. Note that the rankings defined us- 
ing Fisher's exact test and the approximation to G'^ 
are identical as are the rankings as determined by the 
X^ approximation to X'^ and the t-test. Notice further 
that the significance values assigned by Fisher's exact 
test are similar to the values as assigned by the x^ 
approximation to G'^ for the most dependent bigrams. 
However, there is some variation between the signif- 
icance computed for Fisher's test and G'^ among the 



more independent bigrams. This confirms the obser- 
vation made by (Dunning 1993) that G'^ tends to over- 
state independence. This indicates that the asymptotic 
approximation of G'^ by the x^ distribution is break- 
ing down for those bigrams. In this case Fisher's test 
provides a more reliable significance value. The sig- 
nificance values assigned using the x^ approximation 
to Pearson's X'^ and the t-test are very different from 
those assigned by Fisher's exact test. This indicates 
that neither X'^ nor the t-statistic is holding to its as- 
sumed asymptotic approximation. 

Conclusions 

In this paper we examined recent work in identifying 
dependent bigrams. This work has used asymptotic 
significance tests when exact ones would have been 
more appropriate. 

When asymptotic methods are used there are re- 
quirements regarding both the sampling plan and the 
distributional characteristics of the data that must be 
met. If the distributional requirements are not met, as 
is frequently the case in NLP, then Fisher's exact test is 
a viable alternative to asymptotic tests of significance. 
The SAS system allows for convenient computation of 
Fisher's exact test using PROG FREQ. 
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^Please contact the author at pedersen@seas.smu.edu if 
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Figure 8: test for association : <word> industry 



Appendix: PROC FREQ output for Tea Drinker's Problem 

The SAS System 
TABLE OF X BY Y 
Y ** 

Frequency I 
Expected I 
Deviation I 
Percent I 
Row Pet I 



X ** 



Col Pet 


Imilk 




1 tea 
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STATISTICS FOR 
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OF X 


BY Y 
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DF 


Value 
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.000 
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P(nll = 4) 











.014 


Phi Coefficient 
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.000 
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0, 


.707 
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1, 


.000 







Sample Size = 8 

MARIIIG: lOOy, of the cells have expected counts less 

than 5. Chi-Square may not be a valid test. 
**: Added by the author. Hot a part of PROC FREQ output 



The SAS System 



TABLE OF X BY Y 



Y ** 
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STATISTICS FOR TABLE OF X BY Y 
Statistic DF Value Prob 
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0, 
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.500 







Sample Size = 8 

MARIIIG: lOOy, of the cells have expected counts less 

than 5. Chi-Square may not be a valid test. 
**: Added by the author. Hot a part of PROC FREQ output 



The SAS System 



TABLE OF X BY Y 



Y ** 



X** 



Frequency I 
Expected I 
Deviation I 
Percent I 
Row Pet I 

Col Pet I milk I tea I Total 
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STATISTICS FOR TABLE OF X BY Y 
Statistic DF Value Prob 
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Sample Size = 8 

MARIIIG: lOOy, of the cells have expected counts less 

than 5. Chi-Square may not be a valid test. 
**: Added by the author. Hot a part of PROC FREQ output 



The SAS System 



TABLE OF X BY Y 



Y ** 
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Deviation I 
Percent I 
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STATISTICS FOR TABLE OF X BY Y 
Statistic DF Value Prob 



Chi-Square 
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(2-Tail) 
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P(nll = 1) 
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Sample Size = 8 

MARIIIG: lOOy, of the cells have expected counts less 

than 5. Chi-Square may not be a valid test. 
**: Added by the author. Hot a part of PROC FREQ output 



The SAS System 



TABLE OF X BY Y 



Y ** 



X** 



Frequency I 
Expected I 
Deviation I 
Percent I 
Row Pet I 



Col Pet 


Imilk 




1 tea 
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Total 
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1 
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2 1 
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Total 
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STATISTICS FOR TABLE OF X BY Y 



Statistic 



DF 



Value 



Prob 



Chi-Square 1 8.000 0.005 

Likelihood Ratio Chi-Square 1 11.090 0.001 

Continuity Ad j . Chi-Square 1 4.500 0.034 

Mantel-Haenszel Chi-Square 1 7.000 0.008 

Fisher's Exact Test (Left) 0.014 

(Right) 1.000 

(2-Tail) 0.029 



P(nll = 0) 



0.014 ** 



Phi Coefficient 
Contingency Coefficient 
Cramer's V 



-1.000 
0.707 
-1.000 



Sample Size = 8 

MARIIIG: lOOy, of the cells have expected counts less 

than 5. Chi-Square may not be a valid test. 
**: Added by the author. Hot a part of PROC FREQ output 



